Skip to content

Conversation

@stephengoldbaum
Copy link

Summary

This PR introduces a new RDF ingestion source for DataHub, enabling ingestion of RDF/OWL ontologies (Turtle, RDF/XML, JSON-LD, N3, N-Triples) with a focus on business glossaries. The source extracts glossary terms, term hierarchies, and relationships from RDF files using standard vocabularies like SKOS, OWL, and RDFS.

What's New

Core Features

  • RDF Ingestion Source (type: rdf) - Native DataHub plugin for RDF/OWL ontologies
  • Multiple Format Support - Turtle, RDF/XML, JSON-LD, N3, N-Triples
  • Flexible Source Loading - Files, directories (with recursive option), URLs, and comma-separated file lists
  • Glossary Term Extraction - Converts skos:Concept and owl:Class to DataHub GlossaryTerms
  • Glossary Node Hierarchy - Auto-creates glossary nodes from IRI path hierarchies
  • Term Relationships - Extracts skos:broader and skos:narrower relationships as isRelatedTerms
  • Stateful Ingestion - Full support for stale entity removal via stateful_ingestion config
  • Platform Instance Support - Configurable platform instances via platform_instance config

Architecture

  • Modular Entity Processing - Clean separation with extractors, converters, and MCP builders
  • Dependency-Based Processing - Topological sort for correct entity processing order
  • Test Connection Support - Implements test_connection() for connection validation

Capabilities

The source supports the following DataHub capabilities:

Capability Status Notes
Glossary Terms Enabled by default
Glossary Nodes Auto-created from IRI path hierarchies
Term Relationships Supports skos:broader and skos:narrower
Detect Deleted Entities Requires stateful_ingestion.enabled: true
Platform Instance Supported via platform_instance config
Extract Descriptions Enabled by default (from skos:definition or rdfs:comment)
Data Domain Not applicable (domains used internally for hierarchy)
Dataset Profiling Not applicable
Extract Lineage Not in MVP
Extract Ownership Not supported
Extract Tags Not supported

Testing

Test Coverage

  • 126 unit tests - Comprehensive coverage of core functionality, error handling, and edge cases
  • 16 integration tests - End-to-end testing with golden file validation
  • Test scenarios include:
    • Simple glossary ingestion
    • Glossary with relationships
    • Glossary with domains
    • Multiple RDF formats (Turtle, RDF/XML, JSON-LD)
    • Recursive directory ingestion
    • Export filtering (export_only, skip_export)
    • Stateful ingestion with stale entity removal
    • Error handling (missing files, malformed RDF, invalid formats)
    • Large file performance warnings
    • Path traversal protection

Test Files

  • tests/unit/rdf/ - Unit tests for individual components
  • tests/integration/rdf/ - Integration tests with golden file validation
  • All tests passing ✅

Documentation

User Documentation

  • docs/sources/rdf/rdf.md - Comprehensive user guide (489 lines)
    • Quickstart guide
    • Configuration reference
    • RDF format and source types
    • Dialects and selective export
    • Stateful ingestion guide
    • Example RDF files
    • IRI-to-URN mapping
    • Glossary node hierarchy
    • Supported vocabularies
    • Limitations and troubleshooting

Recipe Examples

  • docs/sources/rdf/rdf_recipe.yml - Example recipes for basic and stateful ingestion

Integration Test Documentation

  • tests/integration/rdf/README.md - Detailed guide for running integration tests

Configuration Example

source:
type: rdf
config:
source: ./glossary.ttl
format: turtle
environment: PROD
stateful_ingestion:
enabled: true
remove_stale_metadata: true
export_only:
- glossary## Files Changed

Technical Notes

Security & Performance

  • URL Loading Security - Timeout limits, size limits, and redirect limits for safe URL loading
  • Path Traversal Protection - Configurable enforcement to prevent access outside intended directories
  • Memory Efficiency - Generator patterns for work unit generation and streaming for large URL downloads
  • Format Validation - Validates RDF formats before processing

Code Quality

  • Thread-Safe Registry - Entity registry uses double-checked locking pattern for thread safety
  • Component Validation - Validates registered components for entity type consistency
  • Type Safety - Complete type hints with proper forward references for MCP return types
  • Error Handling - Granular error reporting with structured logs and context
  • URN Generation - Standardized URN format using dot notation, proper encoding, and GUID fallback for non-ASCII characters

New Files

  • src/datahub/ingestion/source/rdf/ingestion/rdf_source.py - Main source implementation
  • src/datahub/ingestion/source/rdf/core/rdf_loader.py - RDF loading utilities with security
  • src/datahub/ingestion/source/rdf/core/urn_generator.py - URN generation with encoding
  • src/datahub/ingestion/source/rdf/entities/base.py - Base interfaces for entity processing
  • src/datahub/ingestion/source/rdf/entities/registry.py - Thread-safe entity registry
  • docs/sources/rdf/rdf.md - User documentation
  • docs/sources/rdf/rdf_recipe.yml - Recipe examples
  • tests/integration/rdf/test_rdf_source.py - Integration tests
  • tests/unit/rdf/ - Unit tests (multiple files)

Modified Files

  • setup.py - Added RDF source to entry points (line 862)

Breaking Changes

None - This is a new feature addition with no breaking changes to existing functionality.

Support Status

The RDF source is marked as INCUBATING (SupportStatus.INCUBATING), indicating it's ready for community adoption but may have minor version changes in future releases based on feedback.

Checklist

  • Plugin registered in setup.py
  • Source class properly decorated (@platform_name, @config_class, @support_status)
  • Capability decorators added
  • Stateful ingestion implemented
  • test_connection() implemented
  • Comprehensive error handling
  • Security measures (timeouts, size limits, path traversal protection)
  • Memory-efficient patterns (generators, streaming)
  • Thread-safe registry
  • Type hints complete
  • All tests passing
  • User documentation complete
  • Integration test documentation
  • Code follows DataHub standards
  • Linting passes

@github-actions github-actions bot added ingestion PR or Issue related to the ingestion of metadata product PR or Issue related to the DataHub UI/UX labels Dec 19, 2025
@github-actions
Copy link
Contributor

Linear: ING-1308

@alwaysmeticulous
Copy link

alwaysmeticulous bot commented Dec 19, 2025

✅ Meticulous spotted 0 visual differences across 951 screens tested: view results.

Meticulous evaluated ~8 hours of user flows against your PR.

Expected differences? Click here. Last updated for commit 8c9e53d. This comment will update as new commits are pushed.

@codecov
Copy link

codecov bot commented Dec 19, 2025

Bundle Report

Changes will increase total bundle size by 2.94kB (0.01%) ⬆️. This is within the configured threshold ✅

Detailed changes
Bundle name Size Change
datahub-react-web-esm 29.55MB 2.94kB (0.01%) ⬆️

Affected Assets, Files, and Routes:

view changes for bundle: datahub-react-web-esm

Assets Changed:

Asset Name Size Change Total Size Change (%)
assets/index-*.js 2.94kB 19.38MB 0.02%

Files in assets/index-*.js:

  • ./src/app/ingestV2/source/builder/sources.json → Total Size: 39.44kB

  • ./src/app/ingestV2/source/builder/RecipeForm/constants.ts → Total Size: 12.11kB

  • ./src/app/ingestV2/source/builder/constants.ts → Total Size: 7.46kB

  • ./src/app/ingestV2/source/builder/RecipeForm/rdf.ts → Total Size: 2.86kB

….Class and RDFS.Class

- Updated the logic in `GenericDialect` to exclude ontology construct types while allowing OWL.Class and RDFS.Class to coexist with SKOS.Concept, enhancing compatibility with RDF standards.
- Updated the rdflib dependency in setup.py to specify a version range of >=6.0.0,<7.0.0, ensuring compatibility with existing RDF handling features.
- Updated the rdflib dependency in setup.py to specify an exact version of 6.3.2, ensuring compatibility with existing RDF handling features and preventing potential issues with future releases.
…er, and URN generator

- Introduced new unit tests for various edge cases in RDF dialects (Generic, FIBO, Default), including handling of empty graphs, missing labels, and special characters.
- Added tests for the RDF loader to cover format validation, file handling, URL loading, and zip file scenarios.
- Implemented edge case tests for the URN generator, focusing on IRI parsing and platform normalization.
- Enhanced overall test coverage to ensure robustness and reliability of RDF processing components.
… requests_file

- Modified the RDF plugin dependencies in setup.py to add specific versions of requests (2.32.5) and requests_file (3.0.1) alongside rdflib (6.3.2), ensuring compatibility and stability for RDF processing.
…n in documentation generation

- Added checks and warnings for missing platforms and plugins when processing README and documentation files, ensuring robustness in the documentation generation process.
- Improved logging to provide clearer feedback when encountering issues with platform or plugin names during the generation of custom documentation.
…tion generation

- Added a new metric for tracking missing capability data in the PluginMetrics class.
- Changed error logging to a warning when a plugin is not found in capability data, incrementing the new metric instead of the failed count.
- Updated exit behavior to only return an error code for actual failures, enhancing the robustness of the documentation generation process.
- Added new capabilities for ABS Data Lake, including support for containers, data profiling, and tags.
- Enhanced Athena source capabilities with additional features such as lineage fine, schema metadata, and test connection.
- Updated platform details and support statuses for both ABS and Athena sources.
- Removed outdated entries and streamlined the capability structure for clarity.
…and URN generator

- Introduced extensive unit tests for the RDF loader, covering URL loading, error handling, and path traversal protection.
- Added tests for the EntityRegistry, focusing on registration methods, CLI name mapping, and processing order.
- Implemented tests for the URN generator, addressing IRI parsing, platform derivation, and structure preservation.
- Enhanced overall test coverage to ensure robustness and reliability of RDF processing components.
…ests

- Updated the test cases in `test_registry_comprehensive.py` to include type hints for the `EntityProcessor` instantiation, improving code clarity and type safety.
- Ensured consistency in the test setup for better maintainability and readability.
- Introduced extensive unit tests for the RDF loader, enhancing coverage for format validation, error handling, and file/URL detection.
- Added comprehensive tests for the URN generator, focusing on platform normalization, IRI parsing, and path derivation edge cases.
- Improved overall test coverage to ensure robustness and reliability of RDF processing components.
… tests

- Updated unit tests in `test_urn_generator_additional_coverage.py` to include type ignores for invalid type checks in platform normalization, IRI path derivation, and group name generation methods.
- Enhanced error handling assertions to maintain clarity while addressing type-related warnings from static analysis tools.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution PR or Issue raised by member(s) of DataHub Community ingestion PR or Issue related to the ingestion of metadata needs-review Label for PRs that need review from a maintainer. product PR or Issue related to the DataHub UI/UX

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants